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BACKGROUND OF THE INVENTION 
1. Field of the invention 

5 

This invention concerns the automatic verification of the authenticity of data printed 
onto a check. High speed, low resolution optical scanners are the means of image 
acquisition prior to analysis. As with the prior art, the basis of the method is the 
comparison of human readable information with information that is solely machine 

10 readable ('machine readable' information, data, symbols etc.) added to the check at the 
time of printing. Someone fraudulently altering the check (e.g. to change the payee 
name) may be able to quite readily alter the human readable printed information, but 
should find it far harder to alter the machine readable data to match the fraudulendy 
altered human readable data if the way that the machine readable data encodes 

1 5 information is secret and secure. Although 'human readable' information is also machine 
readable using conventional OCR, we use the term 'machine readable' to refer to 
information that is not readily and ordinarily human readable. 



2. Description of the Prior Art 

20 

There is a need to provide a cheap and rapid means of corroborating the authenticity of 
the critical, human readable data on checks in order to identify fraudulent falsification. 
Checks are the subject of high speed printing and scanning operations: operational 
constraints generally require anti-fraud techniques to integrate with existing schemes. 
25 Thus, a number of methods have been proposed in which human readable data, plus 
additional machine readable symbols or data that encode the same data as the human 
readable data, are printed onto checks. Both kinds of data are subsequently scanned and 
analysed by image sorters. 

30 Verification of checks by adding machine readable symbols has a long history. A method 
of authenticating check data was provided by Szepenski (German Patent 29 43 436 Al) 
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in 1979, although his description was not particularly concerned with the workflow issues 
associated with image sorters and the like. Text on documents in his method was to be 
authenticated by means of a machine readable pattern which contained the same 
information as the human readable text and extended over the whole document. The 
5 pattern was to contain all of the textual information and in a paper published more or 
less concurrently Szepenski suggested the use of standard error correction techniques to 
overcome the inevitable problems of accurate machine reading. 

EP 0 699 327 Bl Abathorn describes a modified version of Szepenski' s method in which 
10 the machine readable data is in the form of a bar code (or other symbology which is not 
specified) to be added to the check. This patent goes further by describing how the 
machine readable the data might be added " in a single pass through the printing system 
enabling high speed automated mass production of bearer documents." There is little 
description of the coding method but the fact that "if a user's name has been obscured, 
15 the name can be recovered if the name was selected as a value critical data item" suggests 
that the machine readable data is not hashed or encrypted. 

Ramzy, US 6,073,121 also describes a method of protecting a check by adding machine 
readable data, in this case the data comprising " all the check data" in the form of a bar 
20 code or other symbol. Ramzy differs from Abathorn in that the added data is encrypted. 
The implication is that the data retrieved from the bar code must be retrieved in its 
entirety or else it is not decipherable, and this implies that the bar code or other symbol 
must be robust against poor quality imaging. 

25 In US 6,243,480 Zhao et al authenticate check data by adding "authentication 
information" in machine readable form, this form either being a watermark or a symbol 
which could be a bar code. The authentication information described in the patent is 
some form of digest of semantic information. The digest formed from the OCR allows a 
certain amount of latitude in that commonly confused characters such as "c" and "e" are 

30 allowed to be in error without destroying the correspondence between the two versions 
of the authentication information. However, the machine readable code is such that 
corruption of it is not reversible and there is no possibility of relaxing the equality 
condition if, for instance, a bar code is damaged by a scanning problem. As is stated in 
the claims " an authentication information reader reads the first authentication 



information " and compares with data from "an authenticator that computes the 

second authentication information. " "Reading" the information as opposed to 
computing information in general allows no scope for adjusting to poor quality images. 

5 Similarly in US 6,170,744 Payformance tackles the problem of self authentication by 
including authenticating data in machine readable form. The authenticating data as above 
includes some form of digest in the form of a hash, signature or encryption and in each 
case the data is not reversible or would not be reversible if some uncorrectable reading 
error occurred. The verification is by equality of two values and has no provision for 
1 0 close misses or data adjustment. 

In US 6,233,340 Sandru describes yet another method of adding authenticating data and 
here again the data is concatenated in some way which prevents it being deciphered 
when damaged by the imaging process. 

15 

It is also well known that certain characters are easily confused by the OCR process, 
characters such as O and 0, C and O, F and E etc. Now in US 6,243,480 Mediasec) these 
sorts of characters are allowed to be considered interchangeable to reduce the OCR 
reading errors, but no information about how much confusion has occurred will be 
20 available. 

An important distinction in methods described in the prior art is between those that 
aggregate the characters in some manner and calculate a representative value and those 
that encode the characters individually. The implication is that where data has been 
25 aggregated any failure in the retrieval process may render the whole of the data invalid, 
whereas if the data is segregated, damage to parts of the data may leave the remaining 
data decipherable. 

The two commonly used forms of data aggregation are encryption and hashing. In all 
30 standard encryption algorithms, e.g DES, RSA, Blowfish, it is regarded as important that 
each bit of the plaintext affects every other bit to produce the ciphertext, this 
requirement rendering the breaking of the code much more difficult. A consequence of 
this is that alteration of any portion of the cipher text has a potential effect on every bit 
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of the plaintext. Thus if ciphertext is embedded in the machine readable code and any 
part of that code cannot be correcdy retrieved the whole of the plaintext is invalid. 

In the case of hashing, a similar situation holds. Hashing algorithms e.g MD5, SHA1 etc 
5 are designed so that hashed values which differ slighdy correspond to originals that differ 
considerably. Again, if a hash value of the text is embedded in machine readable form, 
any minor error in the subsequent reading of the text data will produce a totally different 
hash value and no information will be given about the matching of items. . 

10 In those versions of the prior art which use encryption or hashing, any misread in either 
OCR or the machine readable data will result in mismatch of the values that are required 
for authentication. The only outcome of such a comparison is agreement or non 
agreement and the level of disagreement is identical whether one or all of the original 
data characters is misread by the OCR, or whether one or all of the bits of the version of 

15 the hash value after error correction is altered. The check printing and scanning 
environment is an especially demanding one since checks are printed in large volume at 
very high speeds; scanning also operates on very high volumes of checks with relatively 
low resolution. Hence, it is an especially challenging environment for an automated 
document authentication system. 

20 

Given that for the data on checks the probability of correct automated identification of 
all of the human readable characters is at best 98 to 99%, then a huge number of checks 
will be incorrectly identified as fraudulent using conventional hashing or encryption 
based techniques. 

25 

In most methods therefore, the machine readable encoded data is some representation of 
the totality of all of the data, so that damage to a part of the representation removes the 
possibility of any meaningful data retrieval; this gready hampers the speed of the 
automatic verification of authenticity where fast, high volume printers are used to print 
30 the checks and fast, low resolution scanners are used to scan them since the authenticity 
of so many checks cannot be automatically verified. For large scale systems issuing 
several million checks a month, even a false rejection rate of 2% leads to huge numbers 
of checks that are needlessly rejected by an automated system and then have to be 
manually scmtinised for authenticity. 
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The problem with methods that have so far been proposed is that no proper account is 
taken of the degradation that may well occur to the added symbols during normal 
printing, as well as the inevitable misreading that is inherent in OCR of human readable 
5 data. Simply rejecting checks where the OCR of the human readable data does not 
identically match the retrieved machine readable data results in large numbers of 
satisfactory checks being sent for inspection, which is both costly and slow. 
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SUMMARY OF THE PRESENT INVENTION 

In a first aspect, the invention is a method of automatically verifying the authenticity of a 
printed document which includes printed human readable data and corresponding 
5 machine readable data, the method comprising the steps of: 

(a) scanning the document to generate a scanned image; 

(b) interpreting the individual characters printed as human readable data and 
interpreting the individual characters printed as machine readable data; 

(c) assessing the probability that any mismatch between the individual characters 
10 interpreted from the human readable data and the machine readable data has arisen 

through errors or artefacts introduced in printing or scanning the document and not 
deliberate falsification of the human readable data. 

The invention arises from the recognition that both human readable data printed on a 
15 check and machine readable data added to the check at the time of check printing to 
graphically encode the human readable data are subject to errors and artefacts during the 
initial printing and subsequent scanning processes: if, after scanning, there is a less than 
perfect match in the two forms of data, that does not therefore necessarily imply 
fraudulent alteration of the human readable data. The present invention enables a 
20 quantitative, probability-based interpretation of the degree and the kind of mismatch to 
verify authenticity. 



The assessed probability of mismatch arising through printer or scanner error or artefact 
25 may be a function of the quality of the scanned image; image quality can be measured as 
a function of one or more of: the lightness or darkness of the image; the contrast of the 
image; whether features of known shape in the document appear in a similar shape in the 
scanned image; the degree of adjustment required to make mismatched characters match; 
mismatch from MICR data; orientation accuracy of the scanned image. 

30 

This is valuable because as image quality deteriorates, it is very useful to be able to 
automatically relax the matching requirements between the scanned and interpreted 
human readable text and the machine readable text, since mismatches are more likely to 
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be due to errors or artefacts rather than fraudulent alteration. This relaxation of 
matching requirements can be done in several ways, such as altering a probability based 
interpretation of what the human readable data and /or machine readable data is (e.g. 
allowing a character that appears to be a V also to be a 'o' and a T). 

5 

The assessed probability may be a function of the relative position or distribution of any 
mismatches such that clustered mismatches decrease the probability that the mismatches 
arise through printer or scanner error or artefact (except in cases of localised image 
degradation identifiable by irregularities of lines, i.e. the local image quality). The 
10 assessed probability may also be a function of the font used for the machine readable 
data. 

The present invention also enables an operator to alter the probability based 
interpretations, and to alter the required degree of matching for the system to deem a 

15 check to be authenticated. This is very useful since the errors and artefacts introduced 
by printing and scanning can alter: for example, due to slight scanner lens mis-alignment, 
all scanned images produced by a particular scanner might on one particular day have a 
very high likelihood of leading to a TF being interpreted as a C N'; then, the operator can 
'tune' the system to de-sensitise it to mismatches of H and N: hence, if the human 

20 readable data is interpreted as C NUGN' but the machine readable data is interpreted as 
the name 'HUGH', the system will automatically know that the mismatch is not 
indicative of fraudulent alteration but is far more likely to be associated with scanner 
error. 

25 A function representing the probability of falsification, rather than error or artefact, can 
be empirically derived by analysing extensive manual assessments made by skilled 
operators of different kinds of mismatches. 

In an implementation, we map the probability of each member of an alphabet (e.g. letters 
30 A — Z, plus a given number range) corresponding to any feature that is identified as a 
character in the human readable data. Hence, a circular feature in the human readable 
data would have a high probability of being the letter c o', but a low probability of being 
the letter T. Similarly, we map the probability of each member of the alphabet (e.g. A — 
Z, plus a given number range) corresponding to any feature that is identified as a 



character in the machine readable data. For example, a sequence of two vertical bars 
might have a high probability of being the letter V and low probability of being the letter 
Y. This probability mapping process is done in respect of large amounts of trial data 
from large numbers of sample checks, but using the same printing and scanning 
5 equipment that would be used in practice to print and to scan real checks at high volume 
and high speed. Once the probability mapping is complete, then verification of the 
authenticity of a real check involves in essence scanning that cheque to establish if there 
is a perfect match between the scanned and interpreted human readable data and the 
scanned, interpreted machine readable data. If there is no perfect match, then, instead of 

10 rejecting the check, the automated verification process of the present invention can 
continue by measuring or obtaining (i) a probabilistic interpretation of the scanned, 
human readable data and also (ii) a probabilistic interpretation of the scanned, machine 
readable data. We then compare the two interpretations to determine if the 
correspondence satisfies a pre-defined threshold. The comparison can take as a base the 

1 5 most likely interpretation of the machine readable data; using this interpretation, we take 
the first character and compare it to each of the different possible characters occupying 
the position of first character in the human readable data. Hence, the machine readable 
data might begin with character 'H\ The first character in the human readable text might 
be an 'FT and also a 'N' at the same level of probability, and a ( M 9 at a lower level of 

20 probability. There is an identical match ( C H' in the machine readable and TT in the 
human readable and a correlation score is kept. This process continues for all characters 
and the cumulative correlation score is then compared to a threshold; if above a 
threshold, the check is passed and if below, the check is sent for further examination. 
Equally, the process can work using each of the most likely characters from the human 

25 readable data as a base and correlating each of these to the possible interpretations of 
each machine readable character. 

If the correlation satisfies the pre-defined threshold, then we accept that the machine 
readable data is sufficiently close to the human readable data for the check to be regarded 
30 as authentic. We have in effect subtracted out the effect of predefined printing and 
scanning errors and artefacts so that these do not lead to erroneous 'false positives' — i.e. 
incorrect indications that a cheque is inauthentic when it in fact is authentic. If the 
correspondence does not satisfy the pre-defined threshold, then the check is submitted 
to more detailed scrutiny. 
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In the context of high speed check printing and scanning, being able to model and 
subtract out the effects of normal printing and scanning errors and artefacts enables a 
very significant reduction in false positives — checks that have to be submitted for further 
5 scrutiny but turn out to be authentic. 

In more general terms, the form of coding for the machine readable data is made to 
depend upon the characteristics of the human readable text and its retrievability and 
comprises independent segments that allow for partial recovery despite localised 
10 degradation. The analyses of the human readable text and machine readable code are 
mutually dependent and, together with external data, provide a probability model for the 
detection of possible fraudulent checks. 

The comparison of the probability based interpretations can use a metric specifically 
15 tailored to one or more of: printer performance; scanner performance; image quality; 
operator assigned rules. Similarly, the first probability based interpretation and the 
second probability based interpretation themselves can use a metric specifically tailored 
to one or more of: printer performance; scanner performance; image quality; operator 
assigned rules. 

20 

Also, the threshold can be varied by an operator depending on one or more of printer 
performance; scanner performance; image quality; operator assigned rules. 

As described above, the present invention requires the comparison of data printed in at 
25 least two different forms on a document, such as a check. The two different forms may 
be a human readable form and a machine readable form. The documents are scanned at 
the time of authentication and the images are analysed to allow a probabilistic 
comparison to be made. Each form of data appearing on the document will require its 
own algorithm to retrieve the encoded data. This algorithm may be a form of OCR, or a 
30 bar code interpreter, or a customised interpreter for special forms of encoding such as 
the 'Seal encoding' which is part of one implementation of the present invention; 'Seal' 
encoding is described in more detail in PCT/GB02/00539, which is incorporated by 
reference herein. 
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The data that is added usually originates in the form of a string of alphanumeric 
characters that may be part or the whole of the data on the document. In the case of 
checks, the data that is embedded could be any selection of the variable data, as opposed 
to the check stock data. This variable data includes payee, amount, account number, date, 
5 bank routing number and data unique to a particular bank. 

In the traditional printing of checks, the added data simply appears in text form and this 
will also be the case in the main implementation of this invention. Thus, the added data 
comprises a set of distinguishable characters. This data or a subset or digest of this data 

10 is added in a form that is machine readable and generally not human readable. The 
machine readable data is embedded in discrete segments so that if one segment is 
damaged the remainder may still be valid and able to give information about the 
likelihood of deliberate falsification. Conventional hashing or encryption based 
techniques cannot meaningfully assess the extent of a mismatch between human readable 

15 text and machine readable text and hence inevitably lead to large numbers of false 
positives. 

The encoding of a given character that might appear in both the human readable text and 
the machine readable text is such that the chance of inaccurately interpreting the 

20 character when in the human readable data as a different character is inversely 
proportional to the chance of inaccurately interpreting the same character as the same 
different character when in the machine readable data. Hence, actual individual coding 
of these characters is such that their chance of confusion in the machine readable code is 
inversely proportional to their chance of confusion by the OCR methods. That is to say, 

25 one form of data embedding is a function of the retrieval probabilities of another form 
of data embedding. 

An implementation of this invention deploys a function which predicts the probability of 
deliberate falsification, as opposed to misreading, by constructing the data retrieval 
30 process to return information about the nature of any errors. Thus the probability of 
deliberate falsification will be a function of the measured quality of the image, the 
machine readable code and the human readable data, measured by the fact that these 
entities give clear, unambiguous symbols or are difficult to resolve. The probability of 
deliberate falsification will also be a function of such parameters as the relative 
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position/distribution of mismatches, e.g. of erroneously detected characters, having 
regard to the fact that falsification usually involves a coherent set of contiguous 
characters rather than randomly separated characters. 

5 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be described with reference to Figure 1(a) to (d), which show how a 
seal that graphically encodes data might appear when scanned and interpreted; Figure 2 
10 which schematically depicts how the probability assessment of whether a mismatch is 
fraudulent or not varies depending on image quality and letter distribution. 
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DETAILED IMPLEMENTATION 

Workflow 

Many large corporations print their own checks in a bulk processing environment using 
5 high speed printers, usually laser printers. The usual method is to have check stock 
preprinted with general information about the Bank to whom the check must ultimately 
be presented for encashment, its routing number and similar data which is common to 
thousands of checks. The individualised data required before issuance of the check 
includes payee name, account number, date, amount of transaction etc. and this is usually 
1 0 added by a laser printer. 

In the present invention, a 'seal' or other machine readable code is printed at the same 
time as the individualised variable data is added. This is generally achieved by adding an 
image of the seal to the printing file before it is despatched to the printer, but it can be 
1 5 equally well achieved by modifying the PCL commands or the use of soft fonts if they 
are the means that will best accord with the running of the system. PCT/GB02/00539 
may be referenced for further details about 'Seal' encoding. 

In the normal cycle of the check, the payee pays in the check to the "Bank of First 
20 Deposit" or to a check cashing oudet. At this point the human readable data is read and 
possibly a cash payment takes place. In one implementation of this invention, the seal 
containing authenticating data will also be read using a simple desk top scanner or 
equivalent check reader. 

25 The check is then forwarded to the issuing Bank or financial services company acting for 
the Bank. High speed scanners are used to capture images of both the front and back of 
the check in a bulk processing mode with minimal human intervention. The analysis of 
data and reconciliation of the checks then takes place using the images. In accordance 
with this invention, the data on any seals will be read and analysed either at this point or 

30 as part of an offline process. At this point also, any checks which do not meet acceptance 
criteria, perhaps on account of being damaged or unreadable or because two forms of 
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data do not match, will be identified as exceptions and be subjected to further 
examination. 

Data for Machine Readable Code 

5 

The variable data that is printed on the checks just prior to issuance includes the payee 
name, the amount, the account number, the date. The amount and account number are 
also included in the MICR line printed at the bottom of the check and which provides 
another machine readable source of data. 

10 

In the proposed implementations, a seal containing at least two of these entities in 
machine readable form will be printed onto the check. We encode algebraically human 
readable data from the check, where the data is in the form of characters from a known 
alphabet, convert the algebraic information into a graphical form, and then print the 
15 graphical form onto the check at the same time as the human readable data is printed. 
The form of coding of the machine readable graphic is dependent upon the 
characteristics and retrievability of the human readable form. 

20 The data is in the form of alphanumeric characters which are converted to binary strings 
before being represented in graphical form. An important feature of the conversion to 
binary format is the fact that each string consists of independent but interleaved 
segments, each segment representing a character or small group of characters. Thus if 
10 letters were to be converted to a binary string, each letter might be represented by 16 

25 bits and these bits might be interspersed in a string of 160 bits according to some rule. If 
one of the letters were to be changed only the 16 corresponding bits would be changed 
and there would be no knock on effect on the rest. Similarly, if 1 bit were to be changed 
only one of the letters would be affected. 

30 The manner of representation of characters in binary form is a key part of this 
implementation. In many applications, the codes representing characters are generated 
using an established error coding technique. Often used are cyclic codes on account of 
their structure which lends itself to easy decoding. In the case of this invention, there is 
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no need for highly structured codes because the chunks of data to be decoded are small 
enough to be handled by cruder methods. The main requirement is that the "Hamming 
Distances" (HD) between codes should be chosen so as to best reflect the quality of 
information derivable from the scanned images. 

The HD between two codes of equal bit length is simply the number of bit positions in 
which the codes differ. Thus if 2 codes have a large HD they are unlikely to be confused 
unless there is a large number of bit errors. The penalty for making HD's too large is that 
the codes become too long and occupy too much of the available payload: The HD 
between binary representations of a pair of characters will be greatest for those pairs 
which are least likely to be easily differentiated by an OCR method. 

The factors affecting the HDs are, according to this invention: 

(a) The quality of the images of the seals. 

In most implementations there will be many checks available to standardise data 
and find expected values for any quality measurements. The quality of the images 
is a function of the resolution of the scanners, their quality in terms of tendency to 
merge distinct features or produce artefacts and any issues arising from the rapid 
transport of checks through the processing system. The quality is also a function of 
the consistency of the printing method and such matters as level of toner within a 
printer. This quality has to determine the overall distribution of HD's of any set of 
codes, ensuring that the likelihood of a misread is at a satisfactory level. Thus if 
image quality is very poor the number of bits in the codes will be increased to 
allow a greater error margin. 

(b) The accuracy of any OCR reader 

The number of errors produced by the OCR reader should give an additional guide 
to the accuracy required of a Seal and hence the overall distribution of HDs. 
In addition HDs should be adjusted to take into account the fact that some 
characters are far more likely to be confused by OCR than others. "O" for instance 
is frequently mistaken for "C" but "Z" is rarely mistaken for "I". To cope with 
this property of OCR the HD's between the code for "O" and code for "C" will 
tend to be larger than between those for "Z" and "I." Thus although the OCR 
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may tend to confuse "O" and "C", the Seal reading would be highly unlikely to do 
so. 

5 With these considerations in mind a set of codes can be generated to represent the 
characters and hence convert the human readable text into a binary string. 

Representation of Data in Machine Readable Form 

10 In a preferred embodiment the form in which the data is added is that described in 
detail in the Bitmorph patent PCT/GB02/00539. 

In an alternative embodiment, the data is added in the form of a two dimensional bar 
code. 

15 

Analysis of Seal 

The scanners provide images of checks, generally in black white, for the purposes of 
analysis. A further source of data may be from the reading of the MICR line by a device 
which reads magnetic ink. 

20 

Where there is machine readable code such as that produced by bar codes, glyphs or 
Seals there are many well described techniques to orientate and scale images prior to 
analysis of individual code bearing symbols. For the purposes of this description it will 
be assumed that the analysis can be taken to the level where the information is contained 
25 in a set of graphics, each graphic being a cell containing a configuration of black and 
white pixels which is to be interpreted. 

Thus where glyphs are used the cells will typically be squares containing black pixels 
which in the original image formed a diagonal stripe, the orientation of the stripe 
30 indicating whether the symbol is to be counted as a "1" or a "0." This configuration will 
be modified by the printing and scanning processes so that what was originally a sharp 
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clear line will become a more irregular feature. The task of the decoder is to interpret 
whether such a feature was meant to be a forward or backward sloping diagonal. 

Similarly if a seal is used, the cells will be of a variety of shapes and will contain 
5 configurations that may originally be vertical or horizontal lines but in the scanned 
images will appear as more diffuse shapes. 

In two dimensional bar codes, the cells will typically be rectangles each containing 4 
black rectangular segments and 4 white spaces in the original form, but after scanning 
1 0 will contain irregularities. 

It is one of the purposes of this invention to assess any of these forms of machine 
readable code for the level of image quality degradation and provide a representative 
quality statistic. By empirically analysing the distribution of this statistic for a large 
1 5 number of checks and associating the quality statistic with the number of errors that is 
produced in the corresponding decoding process, a prediction of likely errors /arte facts 
for a given image with measured quality parameters may be produced. In this way, one 
can assess the probability of mismatch between human readable and machine readable 
data arising through printer or scanner errors /arte facts and not deliberate falsification. 

20 

For glyphs and seals, a set of graphics will correspond to a binary string representing a 
single character. For instance, each of 40 characters may be represented by 16 bits with 
HDs chosen appropriately, in other words 16 graphics go to make up a single character. 
The analysis will allocate to each graphic a "1" or "0" to correspond to the binary string. 
25 In many cases there will be several of the bits interpreted wrongly. If the number of 
errors is within the bounds that the error correction can rectify, the character that is 
allocated will be that whose binary string has the smallest HD from the interpreted 
graphics. 

30 In some implementations instead of allocating one of two possible values to a graphic, a 
range of values will be allocated. A number 100 might indicate, for example, a perfect 
vertical stripe, whilst —100 might indicate a perfect horizontal stripe. A value of +50 
would correspond to a vertical stripe with some extra artefacts. Figure 1 shows a set of 
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graphics before and after scanning with a set of values allocated according to the 
closeness to a vertical or horizontal stripe. 

Calculation of HD is modified thus. A binary code such as 1110 0011 1100 1110 is 
5 allocated 16 values by replacing each "1" by "100" and replacing each "0" by —100. 

Thus the code becomes 

{100,100,100,-100, -100,-100,100,100, 100,100,-100,-100, 100,100,100,-100} 

1 0 The set of scanned graphics corresponding to a character might become, for example, 
{ 80, 70, 70,- 20 , -30, -50, -10, -20, 70, 90, -90,-50 50, 60, 50. 0} 

The HD of this set of 16 graphics from the code would be the sum of the differences for 
each of the 16 components. That is: 
1 5 HD between the scanned code and the tested character 

= 20 + 30 + 30 +80 +70 + 50 + 110 +120 + 30 + 10 + 10 + 50 + 50 +40 +50 
+100 
= 850. 

20 The same calculation would be carried out for each of the codes and the code with the 
smallest HD would be presumed to correspond to the original machine readable data. 

Each set of graphics will be tested against the chosen vocabulary or alphabet of 
characters. In each case there will be an adjustment (corresponding to the value 850 
25 above) needed to match a given scanned code to one of the vocabulary codes. The sum 
of the adjustments gives another metric for comparing the quality of the scanned image. 

The calculation just described is a non-limiting example of a further aspect of the 
invention. The decoding of the seal gives a most probable set of values for the 
30 characters. In addition the decoding of the seal allows the allocation of probabilities to 
one character rather than another. Thus, if for a set of 16 graphics the HD from the 
letter "A" were to be 800 and the HD from the letter "B" were to be 850 there would be 
quite a high probability that if an "A" appeared where a "B" was expected then this was 
due to reading error rather than deliberate falsification. 
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Optical Character Recognition (OCR) 

The variable data, in particular the payee name and the amount are read automatically 
5 from the scanned images by one of the many available OCR software applications. 

In a preferred implementation of this invention, the OCR application reads the human 
readable characters on the check and attributes a probability to some or all of the 
characters in the selected alphabet or vocabulary. In general, the probabilities are only 
10 relevant for two or three characters whose shape most nearly approximates the scanned 
in figure. 

In another implementation, the characters that are read from the Seal are passed to the 
OCR application. The application then considers each supposed character and attributes 
15 a probability to the hypothesis that the character read by the OCR is indeed the one 
proposed by the Seal. This process of verification may thus accept as correct a letter that 
a normal OCR process might reject; OCR might suggest an 'E' where this form of 
verification might accept that the real character was an e F* corrupted by the presence of a 
horizontal line produced by the rapid movement of the cheque across the scanner. 

20 Combining OCR Data and Seal Data 

From the foregoing it can be seen that after the Seal reading and the OCR there will be 
two sets of data which must be compared to authenticate the check in question. 

25 If the OCR data is identical to the Seal data then the check is accepted as authentic. If 
one or more characters differ then an assessment has to be made as to the cause and the 
recommended action. 

In one implementation the assessment might be as follows. 

30 

First, a measure is taken of the degree of difference between the OCR data and the Seal 
data. This might be measured by a metric such as the Levenstein distance which takes 
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into account characters that are substituted, omitted or inserted, or, more appropriately 
by a metric that is specially tailored to match the known attributes of the system (e.g. 
printer attributes and performance; scanner attributes and performance; image quality; 
operator assigned rules etc). The metric will include recognition of the close similarity 
5 between certain pairs of characters. Thus if a Z appeared where an I were expected a 
distance of 1.0 might be ascribed, but if a O appeared in place of an 0 a distance of 0.2 
might be ascribed. 

This metric also takes into account the possible misreads in the Seal where probabilities 
10 can be attached through knowledge of the HDs between characters. 

Modification of the measured distance can result from assessment of the significance of 
the positions in the text in which differences occur. If, for instance, three unmatched 
characters were randomly distributed through the payee text then it is less likely to be the 
1 5 result of deliberate falsification. 

Analysis of the image can be carried out to identify artefacts that have been produced by 
the scanning process. Such artefacts are often easily recognised as arising from the 
movement of the check. A further quality factor is the darkness of the image which 
20 depends both on the amount of toner added at the time of printing and the threshold 
value of the scanner. 

The extent to which the quality factors affect the Seal and OCR is assessed empirically by 
sampling large numbers of checks. This sampling will provide an ongoing 
25 standardisation. 

The overall result is a metric for the difference between Seal and OCR data that is 
dependent on environmental factors, methods selected for coding and means of 
interpreting code in graphic form. 

30 

In one implementation the MICR information on the check is read and compared with 
the supposedly identical information in the Seal. The accuracy or otherwise of this 
comparison is an indicator of the quality of the Seal data, particularly because the MICR 
information is read to a high degree of accuracy. 
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Once the difference between Seal data and OCR data has been calculated a threshold has 
to be decided upon so that checks on one side of the threshold are further examined to 
see if they might be counterfeit. The level of the threshold depends upon the penalties 
5 for false positives and the known likelihood of counterfeits. 

The following is a non limiting example of how the invention might figure in an image 
enables cheque environment typical of the situation arising from implementation of the 
Check 21 Act. Images are scanned at sorters in a central clearing operation. 

10 

In a preoperational pilot scheme, a set of typical cheque images with known text is 
collected for analysis. The cheques will have been subjected to the typical degradation 
that might occur to genuine circulated cheques. An OCR engine is used to read various 
types of the known text data including Payee Name, Amount and Courtesy Amount. The 
15 number and type of reading errors are assessed as a function of: 

(a) image quality as measured by heaviness or lightness of image (usually a function 
of scanners rather than printers,) contrast levels if greyscale, presence of streaks 
(typical artefacts of high speed scanning), accuracy of orientation. 

(b) type of font, for instance, a lower case serif font at no more than 10 point will 
20 have a higher error likelihood than a non serif font in upper case. 

(c) particular characteristics of printing quality from specific accounts. 

Another set of cheques is printed with machine readable symbols of the type to be used, 
unencoded but with known values. Again these cheques are degraded in a typical fashion 
25 and scanned on standard cheque sorters, the images being used for analysis. The 
probability of misread is measured, again as a function of image quality. 

A set of cheques with some degree of error is presented to human operators whose task 
it is to decide whether the error would be regarded as a likely indicator of fraudulent 
30 falsification or as an insignificant typographic change. From this will be derived an 
algorithm that attaches probabilities to various types of discrepancy. Perhaps, for 
instance, one or two isolated letters changed may be regarded as likely typographic 
errors, whereas a group of incorrect adjacent letters would be a cause for further 
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inspection. The decisions will be based on the knowledge of the types of falsification that 
characterise deliberate fraud. 

From these pilot investigations a verification scheme will be constructed. The encoding 
5 for the machine readable code will include a level of error correction that will achieve a 
selected threshold of error, maybe 99.5%. The payload on a check is limited, particularly 
by the resolution of the scanners and so error correction must have a finite limit. The 
probability of occurrence of fraud is a known distribution and an algorithm exists which 
combined with the above probabilities provides a rule for selecting likely exceptions. 

10 The probability is assessed with reference to the distribution of errors within the text. 
This is illustrated in Figure 2, which shows how the relative probability of ana ccidental 
misread as compared to fraudulent alteration varies depending on image quality and also 
letter distribution. For, the likelihood of an accidental misread has to be set higher for a 
low quality image as compared to a high quality image. At a given quality, clustering of 

1 5 mismatched letters leads to a higher likelihood of fraudulent alteration. 

When the cheques with the agreed machine coding are issued and subsequendy returned 
to the clearing sorters, the images produced are analysed. If the text and machine 
readable code are read as agreeing, then the cheque is accepted. If there is a mismatch 
20 then analysis based on the above probability functions takes place as below. 

First, the image quality is measured. If the human readable data is read as textH and the 
machine readable quality is read as textM, the probability that textH is a misread of textM 
is calculated using the probability as a function of image quality and the other factors 

25 cited above. If, for instance, in a poor quality image an *0' is read as a C C the probability 
of this being accidental is high. The probability that textM is a misread of textH is then 
considered, again using the probability as a function of image quality and the level of 
difference between the coded forms of textH and textM. The combined probabilities 
give the probability of accidental error. This probability is then compared with the rules 

30 deduced from the human operators where the probability of a particular type of error is 
assessed as a likely indicator of fraudulent alteration. 

The probabilities in this scheme are continually updated by accumulation of information 
about image quality and levels of fraud. 
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1. A method of automatically verifying the authenticity of a printed document 
which includes printed human readable data and corresponding machine readable data, 
the method comprising the steps of: 

(a) scanning the document to generate a scanned image; 

(b) interpreting the individual characters printed as human readable data and 
interpreting the individual characters printed as machine readable data; 

(c) assessing the probability that any mismatch between the individual characters 
interpreted from the human readable data and the machine readable data has arisen 
through errors or artefacts introduced in printing or scanning the document and not 
deliberate falsification of the human readable data. 

2. The method of Claim 1 in which individual characters are encoded in the 
machine readable data. 

3. The method of Claim 2 in which the form of encoding deployed for the machine 
readable data is a function of the encoding used to construct the human readable data. 

4. The method of Claim 3 in which the encoding of a given character that might 
appear in both the human readable text and the machine readable text is such that the 
chance of inaccurately interpreting the character when in the human readable data as a 
different character is inversely proportional to the chance of inaccurately interpreting the 
same character as the same different character when in the machine readable data. 

5. The method of any preceding Claim in which the assessed probability of 
mismatch arising through printer or scanner error or artefact is a function of the quality 
of the scanned image. 

6. The method of Claim 5 in which the assessed probability is increased as image 
quality decreases. 

7. The method of Claim 5 or 6 in which image quality is measured as a function of 
one or more of: the lightness or darkness of the image; the contrast of the image; 
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whether features of known shape in the document appear in a similar shape in the 
scanned image; the degree of adjustment required to make mismatched characters match; 
mismatch from MICR data; orientation accuracy of the scanned image. 

8. The method of Claim 5 in which the assessed probability is a function of the 
relative position or distribution of any mismatches such that clustered mismatches 
decrease the probability that the mismatches arise through printer or scanner error or 
artefact. 

9. The method of any preceding Claim in which the assessed probability is a 
function of the font used for the machine readable data. 

10. The method of any preceding Claims 5 - 9 in which the assessed probability is a 
function of rules specified by a human operator or empirically derived by analysing 
extensive manual assessments made by skilled operators of different kinds of 
mismatches. 

1 1 . The method of any preceding Claim comprising the steps of: 

(a) establishing a first probability based interpretation of the human readable text; 

(b) establishing a second probability based interpretation of the machine readable 
text; 

(c) assessing the probability by comparing the probability based interpretations. 

12. The method of Claim 11 in which the first probability based interpretation and 
the second probability based interpretation uses a metric specifically tailored to one or 
more of: printer performance; scanner performance; image quality; operator assigned 
rules. 

13. The method of Claim 12 in which the assessment of the probability uses a metric 
specifically tailored to one or more of: printer performance; scanner performance; image 
quality; operator assigned rules. 

14. The method of any preceding Claim in which the document is not submitted to 
further scrutiny if the assessment of probability is above a predefined threshold. 
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15. The method of any preceding Claim in which the document is submitted to 
further scrutiny if the assessment of probability is below a predefined threshold. 

5 16. The method of Claim 14 or 15 in which the threshold can be varied by an 
operator depending on one or more of printer performance; scanner performance; image 
quality; operator assigned rules. 

17. The method of any preceding Claim comprising the steps of: 

10 (a) Encoding algebraically human readable data from a check, where the data is in 
the form of characters from a known alphabet, converting the algebraic information into 
a machine readable graphical form, printing the graphical form onto the check at the 
same time as the human readable data is printed; 
(b) Scanning the said check; 

15 (c) Reading the human readable data using an OCR scheme which allocates 
probabilities of each member of the alphabet corresponding to any feature identified as a 
character; 

(d) Reading the machine readable data and allocating probabilities of each member 
of the alphabet corresponding to any feature identified as a character in the machine 

20 readable data; 

(e) Comparing the resulting sets of probabilities and establishing an overall 
probability that any mismatch is due to reading error rather than deliberate falsification. 

18. The method of Claim 17 where the form of coding of the machine readable 
25 graphical is dependent upon the characteristics and retrievability of the human readable 

data. 

19. The method of Claim 18 where the Hamming distance between binary 
representations of a pair of characters will be greatest for those pairs which are least 

30 likely to be easily differentiated by an OCR method. 

20. The method of Claim 17 where the machine readable data consists of 
independent segments that enable recovery of partial information when there is localised 
degradation. 
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21. The method of claim 17 where the analysis of the machine readable data provides 
a measure of the degradation of the image of the check and this measure in turn and 
assists in the attribution of probabilities to the likelihood of a mismatch arising through 

5 printer or scanner errors or artefacts and not deliberate falsification. 

22. The method of claim 17 where the degradation of the human readable data is 
assessed by image processing methods and assists in the attribution of probabilities to 
the likelihood of a mismatch arising through printer or scanner errors or artefacts and 

1 0 not deliberate falsification. 

23. The method of claim 17 where the probability of occurrence of fraud is a known 
distribution and an algorithm exists which combined with the above probabilities 
provides a rule for selecting likely exceptions. 

15 

24. The method of claim 23 where the probability is assessed with reference to the 
distribution of errors within the text. 

25. The method of claim 17 where the set of elements that make up a character in 
20 the representation in graphical form are distributed throughout that form so as to survive 

moderate localised degradation. 

26. A document that has been subject to automatic verification using the method as 
defined in any preceding Claim 1 — 25. 

25 

27. The document of Claim 26, which is a check. 



ABSTRACT 



26 



VERIFICATION OF AUTHENTICITY OF CHECK DATA 



5 The invention recognises that both human readable data printed on a check and machine 
readable data added to the check at the time of check printing to graphically encode the 
human readable data are subject to errors and artefacts during the initial printing and 
subsequent scanning processes: if, after scanning, there is a less than perfect match in 
the two forms of data, that does not therefore necessarily imply fraudulent alteration of 
1 0 the human readable data. The present invention enables a quantitative, probability-based 
interpretation of the degree and the kind of mismatch to verify authenticity. 
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Figure 1 (a) Part of a Seal (5x4 cells) 

(the outlines of the cells shown here in grey do not appear in the actual Seal) 



Figure 1 (b) Values attached to each graphic 
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Figure 1 (c) Scan of above part of Seal 




Figure 1 (d) Values attached to the scanned graphic in 1 (c) 
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Figure 2 

Set of outcomes with 3 mismatched letters with high quality image 
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Set of outcomes with 3 mismatched letters with low quality image 
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